No description has been provided for this image No description has been provided for this image

Artificial Intelligence and Machine Learning
Introduction to Neural Networks - Cars4u
No description has been provided for this image
Used Car Price Prediction

Problem Statement¶

Business Context¶

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Objective¶

To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.

Data Description¶

The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.

  • Brand: brand name of the car
  • Model Name: model name of the car
  • Location: Location in which the car is being sold or is available for purchase (cities)
  • Year: Manufacturing year of the car
  • Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in km
  • Fuel_Type: The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
  • Transmission: The type of transmission used by the car (Automatic/Manual)
  • Owner_Type: Type of ownership
  • Mileage: The standard mileage offered by the car company in kmpl or km/kg
  • Engine: The displacement volume of the engine in CC
  • Power: The maximum power of the engine in bhp
  • Seats: The number of seats in the car
  • New_Price: The price of a new car of the same model in INR Lakhs (1 Lakh = 100,000 INR)
  • Price: The price of the used car in INR Lakhs

Installing and Importing necessary libraries¶

In [ ]:
# Installing the libraries with the specified version
!pip install --no-deps tensorflow==2.18.0 scikit-learn==1.3.2 matplotlib===3.8.3 seaborn==0.13.2 numpy==1.26.4 pandas==2.2.2 -q --user --no-warn-script-location
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 1.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.9/10.9 MB 19.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 35.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 25.2 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.

Note:

  • After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
  • On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
In [ ]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
import time

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import tensorflow as tf #An end-to-end open source machine learning platform
from tensorflow import keras  # High-level neural networks API for deep learning.
from keras import backend   # Abstraction layer for neural network backend engines.
from keras.models import Sequential  # Model for building NN sequentially.
from keras.layers import Dense

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
In [ ]:
# Set the seed using keras.utils.set_random_seed. This will set:
# 1) `numpy` seed
# 2) backend random seed
# 3) `python` random seed
keras.utils.set_random_seed(812)

# If using TensorFlow, this will make GPU ops as deterministic as possible,
# but it will affect the overall performance, so be mindful of that.
tf.config.experimental.enable_op_determinism()

Loading the dataset¶

In [ ]:
# uncomment and run the following lines in case Google Colab is being used
# from google.colab import drive
# drive.mount('/content/drive')
In [ ]:
# loading the dataset
data = pd.read_csv("used_cars_data.csv")

Data Overview¶

Displaying the first few rows of the dataset¶

In [ ]:
data.head()
Out[ ]:
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Seats New_Price Price mileage_num engine_num power_num Brand Model
0 Mumbai 2010 72000.0 CNG Manual First 5.0 5.51 1.75 26.60 998.0 58.16 maruti wagon
1 Pune 2015 41000.0 Diesel Manual First 5.0 16.06 12.50 19.67 1582.0 126.20 hyundai creta
2 Chennai 2011 46000.0 Petrol Manual First 5.0 8.61 4.50 18.20 1199.0 88.70 honda jazz
3 Chennai 2012 87000.0 Diesel Manual First 7.0 11.27 6.00 20.77 1248.0 88.76 maruti ertiga
4 Coimbatore 2013 40670.0 Diesel Automatic Second 5.0 53.14 17.74 15.20 1968.0 140.80 audi a4

Checking the shape of the dataset¶

In [ ]:
# checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
There are 7252 rows and 14 columns.

Checking 10 random rows of the dataset¶

In [ ]:
# let's view a sample of the data
data.sample(n=10, random_state=1)
Out[ ]:
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Seats New_Price Price mileage_num engine_num power_num Brand Model
2397 Kolkata 2016 21460.0 Petrol Manual First 5.0 9.470 6.00 17.00 1497.0 121.36 ford ecosport
6218 Kolkata 2013 48000.0 Diesel Manual First 5.0 7.880 NaN 23.40 1248.0 74.00 maruti swift
6737 Mumbai 2015 59500.0 Petrol Manual First 7.0 13.580 NaN 17.30 1497.0 117.30 honda mobilio
3659 Delhi 2015 27000.0 Petrol Automatic First 5.0 9.600 5.95 19.00 1199.0 88.70 honda jazz
4513 Bangalore 2015 19000.0 Diesel Automatic Second 5.0 69.675 38.00 16.36 2179.0 187.70 jaguar xf
599 Coimbatore 2019 40674.0 Diesel Automatic First 7.0 28.050 24.82 11.36 2755.0 171.50 toyota innova
186 Bangalore 2014 37382.0 Diesel Automatic First 5.0 86.970 32.00 13.00 2143.0 201.10 mercedes-benz e-class
305 Kochi 2014 61726.0 Diesel Automatic First 5.0 67.100 20.77 17.68 1968.0 174.33 audi a6
4581 Hyderabad 2013 105000.0 Diesel Automatic First 5.0 44.800 19.00 17.32 1968.0 150.00 audi q3
6616 Delhi 2014 55000.0 Diesel Automatic First 5.0 49.490 NaN 11.78 2143.0 167.62 mercedes-benz new

Observations

In [ ]:
# let's create a copy of the data to avoid any changes to original data
df = data.copy()

Checking the data types of the columns for the dataset¶

In [ ]:
# checking column datatypes and number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7252 entries, 0 to 7251
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Location           7252 non-null   object 
 1   Year               7252 non-null   int64  
 2   Kilometers_Driven  7251 non-null   float64
 3   Fuel_Type          7252 non-null   object 
 4   Transmission       7252 non-null   object 
 5   Owner_Type         7252 non-null   object 
 6   Seats              7199 non-null   float64
 7   New_Price          7252 non-null   float64
 8   Price              6019 non-null   float64
 9   mileage_num        7169 non-null   float64
 10  engine_num         7206 non-null   float64
 11  power_num          7077 non-null   float64
 12  Brand              7252 non-null   object 
 13  Model              7252 non-null   object 
dtypes: float64(7), int64(1), object(6)
memory usage: 793.3+ KB

Observations

  • 6 columns are of the object type columns and 7 columns are of numerical type columns

Checking for duplicate values¶

In [ ]:
# checking for duplicate values
df.duplicated().sum()
Out[ ]:
2
  • There are two duplicate value in the data.
  • Let's take a closer look at it.
In [ ]:
df[df.duplicated(keep=False) == True]
Out[ ]:
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Seats New_Price Price mileage_num engine_num power_num Brand Model
3623 Hyderabad 2007 52195.0 Petrol Manual First 5.0 4.36 1.75 19.7 796.0 46.3 maruti alto
4781 Hyderabad 2007 52195.0 Petrol Manual First 5.0 4.36 1.75 19.7 796.0 46.3 maruti alto
6940 Kolkata 2017 13000.0 Diesel Manual First 5.0 13.58 NaN 26.0 1498.0 98.6 honda city
7077 Kolkata 2017 13000.0 Diesel Manual First 5.0 13.58 NaN 26.0 1498.0 98.6 honda city

Observations

  • There is a good chance that two cars of the same build were sold in the same location.
  • But it is highly unlikely that both of them will have the same number of kilometers driven.
  • So, we will drop the row which occurs second.
In [ ]:
df.drop(4781, inplace=True)
df.drop(6940, inplace=True)
In [ ]:
# checking for duplicate values
df.duplicated().sum()
Out[ ]:
0
  • There are no duplicate values

Checking for missing values¶

In [ ]:
df.isnull().sum()
Out[ ]:
0
Location 0
Year 0
Kilometers_Driven 1
Fuel_Type 0
Transmission 0
Owner_Type 0
Seats 53
New_Price 0
Price 1232
mileage_num 83
engine_num 46
power_num 175
Brand 0
Model 0

  • There are missing values in Kilometers_Driven, Seats, Price, mileage_num, engine_num, power_num which can be treated in data pre-processing
  • We will drop the rows where Price is missing as it is the target variable before splitting the data into train and test

Exploratory Data Analysis (EDA) Summary¶

Note: The EDA section has been covered in detail in the previous case studies. In this case study, we will mainly focus on the model building aspects. We will only be looking at the key observations from EDA. The detailed EDA can be found in the appendix section.

The below functions need to be defined to carry out the Exploratory Data Analysis.

In [ ]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Univariate Analysis¶

In [ ]:
# creating a copy of the dataframe
df1 = df.copy()

Price¶

In [ ]:
histogram_boxplot(df1, "Price", kde=True)
No description has been provided for this image

Observations

  • This is a highly skewed distribution.

New_Price¶

In [ ]:
histogram_boxplot(df1, "New_Price", kde=True)
No description has been provided for this image

Observations

  • This is another highly skewed distribution.

Brand¶

In [ ]:
labeled_barplot(df1, "Brand", perc=True, n=10)
No description has been provided for this image
  • Most of the cars in the data belong to Maruti or Hyundai. The price of used cars is lower for budget brands like Porsche, Bentley, Lamborghini, etc. The price of used cars is higher for premium brands like Maruti, Tata, Fiat, etc.

Location¶

In [ ]:
labeled_barplot(df1, "Location", perc=True)
No description has been provided for this image
  • Hyderabad and Mumbai have the most demand for used cars. The price of used cars has a large IQR in Coimbatore and Bangalore.

Fuel_Type¶

In [ ]:
labeled_barplot(df1, "Fuel_Type", perc=True)
No description has been provided for this image
  • Around 1% of the cars in the dataset do not run on diesel or petrol.

Bivariate Analysis¶

Correlation Check¶

In [ ]:
plt.figure(figsize=(15, 7))
sns.heatmap(
    df1.corr(numeric_only = True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
No description has been provided for this image

Observations

  • Power and Engine are important predictors of used car price, but they are also highly correlated to each other.
  • The price of a new car of the same model seems to be an important predictor of the used car price, which makes sense.

Price vs Location¶

In [ ]:
plt.figure(figsize=(12, 5))
sns.boxplot(x="Location", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • The price of used cars has a large IQR in Coimbatore and Bangalore.

Price vs Brand¶

In [ ]:
plt.figure(figsize=(18, 5))
sns.boxplot(x="Brand", y="Price", data=df)
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
  • The price of used cars is lower for budget brands like Maruti, Tata, Fiat, etc.
  • The price of used cars is higher for premium brands like Porsche, Audi, Lamborghini, etc.

Price vs Year¶

In [ ]:
plt.figure(figsize=(18, 5))
sns.boxplot(x="Year", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • The price of used cars has increased over the years.

Data Preprocessing¶

Missing Value Treatment¶

  • Let's drop the rows having NaN in the Price column, which is our target column.
In [ ]:
# considering only the data points where price is not missing
df = df[df["Price"].notna()].copy()

# checking for missing values
df.isnull().sum()
Out[ ]:
0
Location 0
Year 0
Kilometers_Driven 1
Fuel_Type 0
Transmission 0
Owner_Type 0
Seats 42
New_Price 0
Price 0
mileage_num 70
engine_num 36
power_num 143
Brand 0
Model 0

Encoding the categorical variables¶

In [ ]:
df.dtypes
Out[ ]:
0
Location object
Year int64
Kilometers_Driven float64
Fuel_Type object
Transmission object
Owner_Type object
Seats float64
New_Price float64
Price float64
mileage_num float64
engine_num float64
power_num float64
Brand object
Model object

In [ ]:
data_car = df[['Brand', 'Model']].copy()
In [ ]:
df = pd.get_dummies(df,
    columns=df.select_dtypes(include=["object","int64"]).columns.tolist(),
    drop_first=True,dtype=int
)
In [ ]:
# Adding Brand and Model which is stored in data_car variable
# These will be needed during missing value imputation
df_final = pd.concat([df,data_car], axis=1)
In [ ]:
df_final.shape
Out[ ]:
(6018, 287)
In [ ]:
df_final.head()
Out[ ]:
Kilometers_Driven Seats New_Price Price mileage_num engine_num power_num Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Year_1999 Year_2000 Year_2001 Year_2002 Year_2003 Year_2004 Year_2005 Year_2006 Year_2007 Year_2008 Year_2009 Year_2010 Year_2011 Year_2012 Year_2013 Year_2014 Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Fuel_Type_Diesel Fuel_Type_Electric Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_audi Brand_bentley Brand_bmw Brand_chevrolet Brand_datsun Brand_fiat Brand_force Brand_ford Brand_honda Brand_hyundai Brand_isuzu Brand_jaguar Brand_jeep Brand_lamborghini Brand_land Brand_mahindra Brand_maruti Brand_mercedes-benz Brand_mini Brand_mitsubishi Brand_nissan Brand_porsche Brand_renault Brand_skoda Brand_smart Brand_tata Brand_toyota Brand_volkswagen Brand_volvo Model_1000 Model_3 Model_5 Model_6 Model_7 Model_800 Model_a Model_a-star Model_a3 Model_a4 Model_a6 Model_a7 Model_a8 Model_accent Model_accord Model_alto Model_amaze Model_ameo Model_aspire Model_aveo Model_avventura Model_b Model_baleno Model_beat Model_beetle Model_bolero Model_bolt Model_boxster Model_br-v Model_brio Model_brv Model_c-class Model_camry Model_captiva Model_captur Model_cayenne Model_cayman Model_cedia Model_celerio Model_ciaz Model_city Model_civic Model_cla Model_classic Model_cls-class Model_clubman Model_compass Model_continental Model_cooper Model_corolla Model_countryman Model_cr-v Model_creta Model_crosspolo Model_cruze Model_d-max Model_duster Model_dzire Model_e Model_e-class Model_ecosport Model_eeco Model_elantra Model_elite Model_endeavour Model_enjoy Model_eon Model_ertiga Model_esteem Model_estilo Model_etios Model_evalia Model_f Model_fabia Model_fiesta Model_figo Model_fluence Model_fortuner Model_fortwo Model_freestyle Model_fusion Model_gallardo Model_getz Model_gl-class Model_gla Model_glc Model_gle Model_gls Model_go Model_grand Model_grande Model_hexa Model_i10 Model_i20 Model_ignis Model_ikon Model_indica Model_indigo Model_innova Model_jazz Model_jeep Model_jetta Model_koleos Model_kuv Model_kwid Model_lancer Model_laura Model_linea Model_lodgy Model_logan Model_m-class Model_manza Model_micra Model_mobilio Model_montero Model_mustang Model_mux Model_nano Model_new Model_nexon Model_nuvosport Model_octavia Model_omni Model_one Model_optra Model_outlander Model_pajero Model_panamera Model_passat Model_petra Model_platinum Model_polo Model_prius Model_pulse Model_punto Model_q3 Model_q5 Model_q7 Model_qualis Model_quanto Model_r-class Model_rapid Model_redi Model_redi-go Model_renault Model_ritz Model_rover Model_rs5 Model_s Model_s-class Model_s-cross Model_s60 Model_s80 Model_safari Model_sail Model_santa Model_santro Model_scala Model_scorpio Model_siena Model_sl-class Model_slc Model_slk-class Model_sonata Model_spark Model_ssangyong Model_sumo Model_sunny Model_superb Model_swift Model_sx4 Model_tavera Model_teana Model_terrano Model_thar Model_tiago Model_tigor Model_tiguan Model_tt Model_tucson Model_tuv Model_v40 Model_vento Model_venture Model_verito Model_verna Model_versa Model_vitara Model_wagon Model_wr-v Model_wrv Model_x-trail Model_x1 Model_x3 Model_x5 Model_x6 Model_xc60 Model_xc90 Model_xcent Model_xe Model_xenon Model_xf Model_xj Model_xuv300 Model_xuv500 Model_xylo Model_yeti Model_z4 Model_zen Model_zest Brand Model
0 72000.0 5.0 5.51 1.75 26.60 998.0 58.16 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 maruti wagon
1 41000.0 5.0 16.06 12.50 19.67 1582.0 126.20 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hyundai creta
2 46000.0 5.0 8.61 4.50 18.20 1199.0 88.70 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 honda jazz
3 87000.0 7.0 11.27 6.00 20.77 1248.0 88.76 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 maruti ertiga
4 40670.0 5.0 53.14 17.74 15.20 1968.0 140.80 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 audi a4

Train Test Split¶

In [ ]:
# defining the dependent and independent variables
X = df_final.drop(["Price"], axis=1)
y = df_final["Price"]
In [ ]:
# splitting the data in 80:20 ratio for train and temporary data
x_train, x_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2,random_state=1)
In [ ]:
# splitting the temporary data in 50:50 ratio for validation and test data
x_val,x_test,y_val,y_test = train_test_split(x_temp,y_temp,test_size=0.5,random_state=1)
In [ ]:
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in validation data =", x_val.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 4814
Number of rows in validation data = 602
Number of rows in test data = 602

Missing Value Treatment¶

In [ ]:
def print_missing_values_columns(df):
    """
    Filters and prints only the columns from the DataFrame df that contain missing values.

    Parameters:
    - df: DataFrame
        The DataFrame to check for missing values.
    """
    missing_values_columns = df.columns[df.isnull().any()]
    missing_values_counts = df[missing_values_columns].isnull().sum()
    print(missing_values_counts)
In [ ]:
# train data
print_missing_values_columns(x_train)
Kilometers_Driven      1
Seats                 39
mileage_num           59
engine_num            34
power_num            116
dtype: int64
In [ ]:
# validation data
print_missing_values_columns(x_val)
Seats           1
mileage_num     5
power_num      13
dtype: int64
In [ ]:
# test data
print_missing_values_columns(x_test)
Seats           2
mileage_num     6
engine_num      2
power_num      14
dtype: int64

We'll impute these missing values one-by-one by taking the median number of seats for the particular car using the Brand and Model.

In [ ]:
# first, we calculate the median of Seats in the train set grouped by Brand and Model and store in train_grouped_median
train_grouped_median = x_train.groupby(["Brand", "Model"])["Seats"].median()
train_grouped_median
Out[ ]:
Seats
Brand Model
ambassador classic 5.0
audi a3 5.0
a4 5.0
a6 5.0
a7 5.0
... ... ...
volvo s60 5.0
s80 5.0
v40 5.0
xc60 5.0
xc90 7.0

209 rows × 1 columns


Working of the above code

  • It groups the training dataset x_train by the columns Brand and Model
  • Within each group, it selects the Seats column
  • Then, it calculates the median of the Seats column for each group
  • This step effectively creates a mapping of the median number of seats for each unique combination of Brand and Model
In [ ]:
# we will use the calculated median (train_grouped_median) to fill missing values in Seats for corresponding groups in the train set
x_train["Seats"] = x_train.apply(lambda row: row["Seats"] if not pd.isna(row["Seats"]) else train_grouped_median.get((row["Brand"], row["Model"]), np.nan), axis=1)

Working of the above code

For each row in the training dataset x_train:

  • It checks if the value in the selected row of the Seats column (row["Seats"]) is not NaN (pd.isna(row["Seats"]))

  • If the value is not NaN (i.e., it's not missing), it keeps the original value (row["Seats"])

  • If the value is NaN (missing), it uses train_grouped_median.get((row["Brand"], row["Model"]), np.nan) to fetch the median value for the corresponding Brand and Model combination from the train_grouped_median mapping created previously

    • If there's no corresponding median value (i.e., the combination of Brand and Model doesn't exist in train_grouped_median), it assigns NaN (np.nan).

This step essentially fills missing values in the Seats column of the validation dataset x_train using the median values calculated from the training dataset. It ensures that the imputation is done based on the specific Brand and Model combination, preserving the relationship between these features and the Seats column.

In [ ]:
# checking data points where Seats is still missing
x_train[x_train["Seats"].isnull()]
Out[ ]:
Kilometers_Driven Seats New_Price mileage_num engine_num power_num Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Year_1999 Year_2000 Year_2001 Year_2002 Year_2003 Year_2004 Year_2005 Year_2006 Year_2007 Year_2008 Year_2009 Year_2010 Year_2011 Year_2012 Year_2013 Year_2014 Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Fuel_Type_Diesel Fuel_Type_Electric Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_audi Brand_bentley Brand_bmw Brand_chevrolet Brand_datsun Brand_fiat Brand_force Brand_ford Brand_honda Brand_hyundai Brand_isuzu Brand_jaguar Brand_jeep Brand_lamborghini Brand_land Brand_mahindra Brand_maruti Brand_mercedes-benz Brand_mini Brand_mitsubishi Brand_nissan Brand_porsche Brand_renault Brand_skoda Brand_smart Brand_tata Brand_toyota Brand_volkswagen Brand_volvo Model_1000 Model_3 Model_5 Model_6 Model_7 Model_800 Model_a Model_a-star Model_a3 Model_a4 Model_a6 Model_a7 Model_a8 Model_accent Model_accord Model_alto Model_amaze Model_ameo Model_aspire Model_aveo Model_avventura Model_b Model_baleno Model_beat Model_beetle Model_bolero Model_bolt Model_boxster Model_br-v Model_brio Model_brv Model_c-class Model_camry Model_captiva Model_captur Model_cayenne Model_cayman Model_cedia Model_celerio Model_ciaz Model_city Model_civic Model_cla Model_classic Model_cls-class Model_clubman Model_compass Model_continental Model_cooper Model_corolla Model_countryman Model_cr-v Model_creta Model_crosspolo Model_cruze Model_d-max Model_duster Model_dzire Model_e Model_e-class Model_ecosport Model_eeco Model_elantra Model_elite Model_endeavour Model_enjoy Model_eon Model_ertiga Model_esteem Model_estilo Model_etios Model_evalia Model_f Model_fabia Model_fiesta Model_figo Model_fluence Model_fortuner Model_fortwo Model_freestyle Model_fusion Model_gallardo Model_getz Model_gl-class Model_gla Model_glc Model_gle Model_gls Model_go Model_grand Model_grande Model_hexa Model_i10 Model_i20 Model_ignis Model_ikon Model_indica Model_indigo Model_innova Model_jazz Model_jeep Model_jetta Model_koleos Model_kuv Model_kwid Model_lancer Model_laura Model_linea Model_lodgy Model_logan Model_m-class Model_manza Model_micra Model_mobilio Model_montero Model_mustang Model_mux Model_nano Model_new Model_nexon Model_nuvosport Model_octavia Model_omni Model_one Model_optra Model_outlander Model_pajero Model_panamera Model_passat Model_petra Model_platinum Model_polo Model_prius Model_pulse Model_punto Model_q3 Model_q5 Model_q7 Model_qualis Model_quanto Model_r-class Model_rapid Model_redi Model_redi-go Model_renault Model_ritz Model_rover Model_rs5 Model_s Model_s-class Model_s-cross Model_s60 Model_s80 Model_safari Model_sail Model_santa Model_santro Model_scala Model_scorpio Model_siena Model_sl-class Model_slc Model_slk-class Model_sonata Model_spark Model_ssangyong Model_sumo Model_sunny Model_superb Model_swift Model_sx4 Model_tavera Model_teana Model_terrano Model_thar Model_tiago Model_tigor Model_tiguan Model_tt Model_tucson Model_tuv Model_v40 Model_vento Model_venture Model_verito Model_verna Model_versa Model_vitara Model_wagon Model_wr-v Model_wrv Model_x-trail Model_x1 Model_x3 Model_x5 Model_x6 Model_xc60 Model_xc90 Model_xcent Model_xe Model_xenon Model_xf Model_xj Model_xuv300 Model_xuv500 Model_xylo Model_yeti Model_z4 Model_zen Model_zest Brand Model
2369 56000.0 NaN 7.88 19.5 1061.0 NaN 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 maruti estilo
5893 51000.0 NaN 7.88 19.5 1061.0 NaN 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 maruti estilo
  • Maruti Estilo can accommodate 5 people.
In [ ]:
x_train["Seats"] = x_train["Seats"].fillna(5.0)
In [ ]:
# we will use the calculated median (train_grouped_median) to fill missing values in Seats for corresponding groups in the validation set
x_val["Seats"] = x_val.apply(lambda row: row["Seats"] if not pd.isna(row["Seats"]) else train_grouped_median.get((row["Brand"], row["Model"]), np.nan), axis=1)
  • The above code does the same operation as the one previously used for imputing missing values
  • The only difference is that it operates on the validation set (x_val) instead of the training set (x_train)
In [ ]:
# checking the missing values in x_val
print_missing_values_columns(x_val)
Seats           1
mileage_num     5
power_num      13
dtype: int64
In [ ]:
# checking data points where Seats is still missing
x_val[x_val["Seats"].isnull()]
Out[ ]:
Kilometers_Driven Seats New_Price mileage_num engine_num power_num Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Year_1999 Year_2000 Year_2001 Year_2002 Year_2003 Year_2004 Year_2005 Year_2006 Year_2007 Year_2008 Year_2009 Year_2010 Year_2011 Year_2012 Year_2013 Year_2014 Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Fuel_Type_Diesel Fuel_Type_Electric Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_audi Brand_bentley Brand_bmw Brand_chevrolet Brand_datsun Brand_fiat Brand_force Brand_ford Brand_honda Brand_hyundai Brand_isuzu Brand_jaguar Brand_jeep Brand_lamborghini Brand_land Brand_mahindra Brand_maruti Brand_mercedes-benz Brand_mini Brand_mitsubishi Brand_nissan Brand_porsche Brand_renault Brand_skoda Brand_smart Brand_tata Brand_toyota Brand_volkswagen Brand_volvo Model_1000 Model_3 Model_5 Model_6 Model_7 Model_800 Model_a Model_a-star Model_a3 Model_a4 Model_a6 Model_a7 Model_a8 Model_accent Model_accord Model_alto Model_amaze Model_ameo Model_aspire Model_aveo Model_avventura Model_b Model_baleno Model_beat Model_beetle Model_bolero Model_bolt Model_boxster Model_br-v Model_brio Model_brv Model_c-class Model_camry Model_captiva Model_captur Model_cayenne Model_cayman Model_cedia Model_celerio Model_ciaz Model_city Model_civic Model_cla Model_classic Model_cls-class Model_clubman Model_compass Model_continental Model_cooper Model_corolla Model_countryman Model_cr-v Model_creta Model_crosspolo Model_cruze Model_d-max Model_duster Model_dzire Model_e Model_e-class Model_ecosport Model_eeco Model_elantra Model_elite Model_endeavour Model_enjoy Model_eon Model_ertiga Model_esteem Model_estilo Model_etios Model_evalia Model_f Model_fabia Model_fiesta Model_figo Model_fluence Model_fortuner Model_fortwo Model_freestyle Model_fusion Model_gallardo Model_getz Model_gl-class Model_gla Model_glc Model_gle Model_gls Model_go Model_grand Model_grande Model_hexa Model_i10 Model_i20 Model_ignis Model_ikon Model_indica Model_indigo Model_innova Model_jazz Model_jeep Model_jetta Model_koleos Model_kuv Model_kwid Model_lancer Model_laura Model_linea Model_lodgy Model_logan Model_m-class Model_manza Model_micra Model_mobilio Model_montero Model_mustang Model_mux Model_nano Model_new Model_nexon Model_nuvosport Model_octavia Model_omni Model_one Model_optra Model_outlander Model_pajero Model_panamera Model_passat Model_petra Model_platinum Model_polo Model_prius Model_pulse Model_punto Model_q3 Model_q5 Model_q7 Model_qualis Model_quanto Model_r-class Model_rapid Model_redi Model_redi-go Model_renault Model_ritz Model_rover Model_rs5 Model_s Model_s-class Model_s-cross Model_s60 Model_s80 Model_safari Model_sail Model_santa Model_santro Model_scala Model_scorpio Model_siena Model_sl-class Model_slc Model_slk-class Model_sonata Model_spark Model_ssangyong Model_sumo Model_sunny Model_superb Model_swift Model_sx4 Model_tavera Model_teana Model_terrano Model_thar Model_tiago Model_tigor Model_tiguan Model_tt Model_tucson Model_tuv Model_v40 Model_vento Model_venture Model_verito Model_verna Model_versa Model_vitara Model_wagon Model_wr-v Model_wrv Model_x-trail Model_x1 Model_x3 Model_x5 Model_x6 Model_xc60 Model_xc90 Model_xcent Model_xe Model_xenon Model_xf Model_xj Model_xuv300 Model_xuv500 Model_xylo Model_yeti Model_z4 Model_zen Model_zest Brand Model
3882 40000.0 NaN 7.88 19.5 1061.0 NaN 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 maruti estilo
  • Maruti Estilo can accommodate 5 people.
In [ ]:
x_val["Seats"] = x_val["Seats"].fillna(5.0)
In [ ]:
# checking the missing values in x_val
print_missing_values_columns(x_val)
mileage_num     5
power_num      13
dtype: int64
In [ ]:
# Same method is applied on test data
x_test["Seats"] = x_test.apply(lambda row: row["Seats"] if not pd.isna(row["Seats"]) else train_grouped_median.get((row["Brand"], row["Model"]), np.nan), axis=1)
In [ ]:
# checking the missing values in x_test
print_missing_values_columns(x_test)
mileage_num     6
engine_num      2
power_num      14
dtype: int64

We will use a similar method to fill missing values for the Kilometers_Driven, mileage_num, engine_num, and power_num columns.

In [ ]:
cols_list = ["Kilometers_Driven","mileage_num", "engine_num", "power_num"]

# Step 1: Calculate the median of specified columns in x_train grouped by Brand and Model
train_grouped_median = x_train.groupby(["Brand", "Model"])[cols_list].median()

# Step 2: Use the calculated median to fill missing values in specified columns for corresponding groups in train, validation and test data
for col in cols_list:
    x_train[col] = x_train.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"], row["Model"]), np.nan), axis=1)
    x_val[col] = x_val.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"], row["Model"]), np.nan), axis=1)
    x_test[col] = x_test.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"], row["Model"]), np.nan), axis=1)
In [ ]:
# checking the missing values in x_train
print_missing_values_columns(x_train)
mileage_num    7
power_num      9
dtype: int64
In [ ]:
# checking the missing values in x_val
print_missing_values_columns(x_val)
mileage_num    1
power_num      1
dtype: int64
In [ ]:
# checking the missing values in x_test
print_missing_values_columns(x_test)
mileage_num    1
power_num      1
dtype: int64
  • There are still some missing values in mileage_num and power_num.
  • We'll impute these missing values by taking the median grouped by the Brand.
In [ ]:
cols_list = ["mileage_num", "power_num"]

# Step 1: Calculate the median of specified columns in x_train grouped by Brand and Model
train_grouped_median = x_train.groupby(["Brand"])[cols_list].median()

# Step 2: Use the calculated median to fill missing values in specified columns for corresponding groups in train, validation and test data
for col in cols_list:
    x_train[col] = x_train.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"]), np.nan), axis=1)
    x_val[col] = x_val.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"]), np.nan), axis=1)
    x_test[col] = x_test.apply(lambda row: row[col] if not pd.isna(row[col]) else train_grouped_median[col].get((row["Brand"]), np.nan), axis=1)
In [ ]:
print_missing_values_columns(x_train)
mileage_num    1
power_num      1
dtype: int64
In [ ]:
print_missing_values_columns(x_val)
Series([], dtype: float64)
In [ ]:
print_missing_values_columns(x_test)
Series([], dtype: float64)
  • There are still some missing values in train data (mileage_num and power_num) and all missing values in val and test data are imputed.
  • We'll impute train missing values using the column median across the entire data.
In [ ]:
cols_list = ["mileage_num", "power_num"]

for col in cols_list:
    x_train[col] = x_train[col].fillna(df[col].median())
In [ ]:
print_missing_values_columns(x_train)
Series([], dtype: float64)
  • Missing values in all columns of x_train are imputed.
In [ ]:
# Dropping Brand and Model from train, validation, and test data as we already have dummy variables for them
x_train = x_train.drop(['Brand','Model'],axis=1)
x_val = x_val.drop(['Brand','Model'],axis=1)
x_test = x_test.drop(['Brand','Model'],axis=1)

Normalizing the numerical variables¶

In [ ]:
# Define the columns to scale
num_columns = ["Kilometers_Driven", "Seats", "New_Price", "mileage_num", "engine_num", "power_num"]

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the selected columns in the x_train data
scaler.fit(x_train[num_columns])
Out[ ]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
  • Once the scaler object fits on the data using the fit() method, it stores the parameters (mean and standard deviation) for normalization based on the training data

  • We then use these parameters to normalize the validation and test data

  • This is similar to what we did in the Missing Value Treatment section.

    • The only difference is that there we had to explicitly store the parameters (median values), while here it is done implicitly by sklearn in this case
In [ ]:
# Transform selected columns in x_train, x_val, and x_test using the fitted scaler

x_train[num_columns] = scaler.transform(x_train[num_columns])

x_val[num_columns] = scaler.transform(x_val[num_columns])

x_test[num_columns] = scaler.transform(x_test[num_columns])
In [ ]:
x_train.head()
Out[ ]:
Kilometers_Driven Seats New_Price mileage_num engine_num power_num Location_Bangalore Location_Chennai Location_Coimbatore Location_Delhi Location_Hyderabad Location_Jaipur Location_Kochi Location_Kolkata Location_Mumbai Location_Pune Year_1999 Year_2000 Year_2001 Year_2002 Year_2003 Year_2004 Year_2005 Year_2006 Year_2007 Year_2008 Year_2009 Year_2010 Year_2011 Year_2012 Year_2013 Year_2014 Year_2015 Year_2016 Year_2017 Year_2018 Year_2019 Fuel_Type_Diesel Fuel_Type_Electric Fuel_Type_LPG Fuel_Type_Petrol Transmission_Manual Owner_Type_Fourth & Above Owner_Type_Second Owner_Type_Third Brand_audi Brand_bentley Brand_bmw Brand_chevrolet Brand_datsun Brand_fiat Brand_force Brand_ford Brand_honda Brand_hyundai Brand_isuzu Brand_jaguar Brand_jeep Brand_lamborghini Brand_land Brand_mahindra Brand_maruti Brand_mercedes-benz Brand_mini Brand_mitsubishi Brand_nissan Brand_porsche Brand_renault Brand_skoda Brand_smart Brand_tata Brand_toyota Brand_volkswagen Brand_volvo Model_1000 Model_3 Model_5 Model_6 Model_7 Model_800 Model_a Model_a-star Model_a3 Model_a4 Model_a6 Model_a7 Model_a8 Model_accent Model_accord Model_alto Model_amaze Model_ameo Model_aspire Model_aveo Model_avventura Model_b Model_baleno Model_beat Model_beetle Model_bolero Model_bolt Model_boxster Model_br-v Model_brio Model_brv Model_c-class Model_camry Model_captiva Model_captur Model_cayenne Model_cayman Model_cedia Model_celerio Model_ciaz Model_city Model_civic Model_cla Model_classic Model_cls-class Model_clubman Model_compass Model_continental Model_cooper Model_corolla Model_countryman Model_cr-v Model_creta Model_crosspolo Model_cruze Model_d-max Model_duster Model_dzire Model_e Model_e-class Model_ecosport Model_eeco Model_elantra Model_elite Model_endeavour Model_enjoy Model_eon Model_ertiga Model_esteem Model_estilo Model_etios Model_evalia Model_f Model_fabia Model_fiesta Model_figo Model_fluence Model_fortuner Model_fortwo Model_freestyle Model_fusion Model_gallardo Model_getz Model_gl-class Model_gla Model_glc Model_gle Model_gls Model_go Model_grand Model_grande Model_hexa Model_i10 Model_i20 Model_ignis Model_ikon Model_indica Model_indigo Model_innova Model_jazz Model_jeep Model_jetta Model_koleos Model_kuv Model_kwid Model_lancer Model_laura Model_linea Model_lodgy Model_logan Model_m-class Model_manza Model_micra Model_mobilio Model_montero Model_mustang Model_mux Model_nano Model_new Model_nexon Model_nuvosport Model_octavia Model_omni Model_one Model_optra Model_outlander Model_pajero Model_panamera Model_passat Model_petra Model_platinum Model_polo Model_prius Model_pulse Model_punto Model_q3 Model_q5 Model_q7 Model_qualis Model_quanto Model_r-class Model_rapid Model_redi Model_redi-go Model_renault Model_ritz Model_rover Model_rs5 Model_s Model_s-class Model_s-cross Model_s60 Model_s80 Model_safari Model_sail Model_santa Model_santro Model_scala Model_scorpio Model_siena Model_sl-class Model_slc Model_slk-class Model_sonata Model_spark Model_ssangyong Model_sumo Model_sunny Model_superb Model_swift Model_sx4 Model_tavera Model_teana Model_terrano Model_thar Model_tiago Model_tigor Model_tiguan Model_tt Model_tucson Model_tuv Model_v40 Model_vento Model_venture Model_verito Model_verna Model_versa Model_vitara Model_wagon Model_wr-v Model_wrv Model_x-trail Model_x1 Model_x3 Model_x5 Model_x6 Model_xc60 Model_xc90 Model_xcent Model_xe Model_xenon Model_xf Model_xj Model_xuv300 Model_xuv500 Model_xylo Model_yeti Model_z4 Model_zen Model_zest
4269 -0.694078 -0.351313 -0.637638 1.136662 -1.034356 -0.841807 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2025 -0.081329 2.126668 -0.674075 -0.765611 -0.708133 -0.731916 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5776 -0.469629 -0.351313 1.297640 -0.287665 0.563805 1.136412 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1710 -0.365282 -0.351313 -0.517681 0.732429 -0.706486 -0.545692 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2363 -0.978527 -0.351313 -0.572951 0.137969 -0.706486 -0.565973 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Utility functions¶

In [ ]:
def plot(history, name):
    """
    Function to plot loss/accuracy

    history: an object which stores the metrics and losses.
    name: can be one of Loss or Accuracy
    """
    fig, ax = plt.subplots() #Creating a subplot with figure and axes.
    plt.plot(history.history[name]) #Plotting the train accuracy or train loss
    plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss

    plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
    plt.ylabel(name.capitalize()) #Capitalizing the first letter.
    plt.xlabel('Epoch') #Defining the label for the x-axis.
    fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.

We'll create a dataframe to store the results from all the models we build

  • We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$.
  • We will define a function to calculate MAPE and adjusted $R^2$.
  • We will create a function which will print out all the above metrics in one go.
In [ ]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# function to compute different metrics to check performance of a neural network model
def model_performance(model,predictors,target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """
    # predicting using the independent variables
    pred = model.predict(predictors).reshape(-1)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf =  {
            "RMSE": [rmse],
            "MAE": [mae],
            "R-squared": [r2],
            "Adj. R-squared": [adjr2],
            "MAPE": [mape]}

    return df_perf

columns = ["# hidden layers","# neurons - hidden layer","activation function - hidden layer ","# epochs","batch size","optimizer","time(secs)","Train_loss","Valid_loss","Train_R-squared","Valid_R-squared"]

results = pd.DataFrame(columns=columns)

Model building¶

We'll use $R^2$ as our metric of choice for the model to optimize.

In [ ]:
#Defining the list of metrics to be used for all the models.
metrics = [tf.keras.metrics.R2Score(name="r2_score")]

Model 0¶

  • We will start off with a simple neural network with
    • No Hidden layers
    • Gradient descent as the optimization algorithm.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 1)              │           285 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 285 (1.11 KB)
 Trainable params: 285 (1.11 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 10
batch_size = x_train.shape[0]
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 426ms/step - loss: 215.6193 - r2_score: -0.7024 - val_loss: 231.8262 - val_r2_score: -0.6387
Epoch 2/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 1s 517ms/step - loss: 199.6886 - r2_score: -0.5766 - val_loss: 215.1222 - val_r2_score: -0.5207
Epoch 3/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 219ms/step - loss: 185.3213 - r2_score: -0.4632 - val_loss: 200.0059 - val_r2_score: -0.4138
Epoch 4/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 302ms/step - loss: 172.3545 - r2_score: -0.3608 - val_loss: 186.3157 - val_r2_score: -0.3170
Epoch 5/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 288ms/step - loss: 160.6429 - r2_score: -0.2683 - val_loss: 173.9070 - val_r2_score: -0.2293
Epoch 6/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 229ms/step - loss: 150.0572 - r2_score: -0.1848 - val_loss: 162.6510 - val_r2_score: -0.1498
Epoch 7/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 136ms/step - loss: 140.4819 - r2_score: -0.1092 - val_loss: 152.4325 - val_r2_score: -0.0775
Epoch 8/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 153ms/step - loss: 131.8141 - r2_score: -0.0407 - val_loss: 143.1481 - val_r2_score: -0.0119
Epoch 9/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 268ms/step - loss: 123.9616 - r2_score: 0.0213 - val_loss: 134.7058 - val_r2_score: 0.0478
Epoch 10/10
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 134ms/step - loss: 116.8424 - r2_score: 0.0775 - val_loss: 127.0229 - val_r2_score: 0.1021
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  2.8553929328918457
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[0]=['-','-','-',epochs,batch_size,'GD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
  • Since it's a very simple neural network, the scores aren't good.

Model 1¶

  • Let's try increasing the epochs to check whether the performance is improving or not.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 1)              │           285 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 285 (1.11 KB)
 Trainable params: 285 (1.11 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = x_train.shape[0]
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 268ms/step - loss: 212.9668 - r2_score: -0.5842 - val_loss: 229.4064 - val_r2_score: -0.6216
Epoch 2/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 167ms/step - loss: 197.4542 - r2_score: -0.5590 - val_loss: 213.0991 - val_r2_score: -0.5064
Epoch 3/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 159ms/step - loss: 183.4492 - r2_score: -0.4484 - val_loss: 198.3264 - val_r2_score: -0.4019
Epoch 4/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 123ms/step - loss: 170.7959 - r2_score: -0.3485 - val_loss: 184.9332 - val_r2_score: -0.3073
Epoch 5/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 163ms/step - loss: 159.3555 - r2_score: -0.2582 - val_loss: 172.7811 - val_r2_score: -0.2214
Epoch 6/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - loss: 149.0038 - r2_score: -0.1764 - val_loss: 161.7463 - val_r2_score: -0.1434
Epoch 7/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 125ms/step - loss: 139.6302 - r2_score: -0.1024 - val_loss: 151.7180 - val_r2_score: -0.0725
Epoch 8/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 140ms/step - loss: 131.1358 - r2_score: -0.0354 - val_loss: 142.5970 - val_r2_score: -0.0080
Epoch 9/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 158ms/step - loss: 123.4324 - r2_score: 0.0255 - val_loss: 134.2945 - val_r2_score: 0.0507
Epoch 10/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 127ms/step - loss: 116.4407 - r2_score: 0.0807 - val_loss: 126.7310 - val_r2_score: 0.1042
Epoch 11/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 163ms/step - loss: 110.0903 - r2_score: 0.1308 - val_loss: 119.8349 - val_r2_score: 0.1529
Epoch 12/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 284ms/step - loss: 104.3176 - r2_score: 0.1764 - val_loss: 113.5422 - val_r2_score: 0.1974
Epoch 13/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 150ms/step - loss: 99.0660 - r2_score: 0.2178 - val_loss: 107.7954 - val_r2_score: 0.2380
Epoch 14/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 268ms/step - loss: 94.2847 - r2_score: 0.2556 - val_loss: 102.5428 - val_r2_score: 0.2751
Epoch 15/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 139ms/step - loss: 89.9281 - r2_score: 0.2900 - val_loss: 97.7379 - val_r2_score: 0.3091
Epoch 16/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 142ms/step - loss: 85.9551 - r2_score: 0.3214 - val_loss: 93.3388 - val_r2_score: 0.3402
Epoch 17/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 154ms/step - loss: 82.3291 - r2_score: 0.3500 - val_loss: 89.3079 - val_r2_score: 0.3687
Epoch 18/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 296ms/step - loss: 79.0171 - r2_score: 0.3761 - val_loss: 85.6112 - val_r2_score: 0.3948
Epoch 19/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 270ms/step - loss: 75.9892 - r2_score: 0.4000 - val_loss: 82.2183 - val_r2_score: 0.4188
Epoch 20/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 143ms/step - loss: 73.2189 - r2_score: 0.4219 - val_loss: 79.1014 - val_r2_score: 0.4408
Epoch 21/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 134ms/step - loss: 70.6820 - r2_score: 0.4419 - val_loss: 76.2357 - val_r2_score: 0.4611
Epoch 22/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 126ms/step - loss: 68.3569 - r2_score: 0.4603 - val_loss: 73.5987 - val_r2_score: 0.4797
Epoch 23/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 148ms/step - loss: 66.2241 - r2_score: 0.4771 - val_loss: 71.1701 - val_r2_score: 0.4969
Epoch 24/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 306ms/step - loss: 64.2660 - r2_score: 0.4926 - val_loss: 68.9315 - val_r2_score: 0.5127
Epoch 25/25
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 280ms/step - loss: 62.4666 - r2_score: 0.5068 - val_loss: 66.8663 - val_r2_score: 0.5273
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  4.981976270675659
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[1]=['-','-','-',epochs,batch_size,'GD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
  • As expected, we see an increase in the $R^2$, which is great.

Model 2¶

  • Even though the performance of the previous model was good, the improvement in scores from one epoch to another is very slight since the updates happen only once.
  • Let's now incorporate SGD to improve learning.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 1)              │           285 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 285 (1.11 KB)
 Trainable params: 285 (1.11 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 32
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 86.8465 - r2_score: 0.4432 - val_loss: 35.5213 - val_r2_score: 0.7489
Epoch 2/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 33.6671 - r2_score: 0.7365 - val_loss: 32.9409 - val_r2_score: 0.7671
Epoch 3/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 31.4131 - r2_score: 0.7541 - val_loss: 31.3458 - val_r2_score: 0.7784
Epoch 4/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - loss: 30.0467 - r2_score: 0.7648 - val_loss: 30.2675 - val_r2_score: 0.7860
Epoch 5/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 29.1110 - r2_score: 0.7721 - val_loss: 29.4811 - val_r2_score: 0.7916
Epoch 6/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 23ms/step - loss: 28.4153 - r2_score: 0.7775 - val_loss: 28.8686 - val_r2_score: 0.7959
Epoch 7/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 29ms/step - loss: 27.8644 - r2_score: 0.7818 - val_loss: 28.3658 - val_r2_score: 0.7995
Epoch 8/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 27.4077 - r2_score: 0.7854 - val_loss: 27.9365 - val_r2_score: 0.8025
Epoch 9/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 28ms/step - loss: 27.0161 - r2_score: 0.7884 - val_loss: 27.5594 - val_r2_score: 0.8052
Epoch 10/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 26.6721 - r2_score: 0.7911 - val_loss: 27.2215 - val_r2_score: 0.8076
Epoch 11/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 23ms/step - loss: 26.3648 - r2_score: 0.7935 - val_loss: 26.9147 - val_r2_score: 0.8097
Epoch 12/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 26.0867 - r2_score: 0.7957 - val_loss: 26.6333 - val_r2_score: 0.8117
Epoch 13/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 25.8324 - r2_score: 0.7976 - val_loss: 26.3735 - val_r2_score: 0.8136
Epoch 14/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 25.5982 - r2_score: 0.7995 - val_loss: 26.1323 - val_r2_score: 0.8153
Epoch 15/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 30ms/step - loss: 25.3811 - r2_score: 0.8012 - val_loss: 25.9075 - val_r2_score: 0.8169
Epoch 16/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 27ms/step - loss: 25.1790 - r2_score: 0.8027 - val_loss: 25.6974 - val_r2_score: 0.8183
Epoch 17/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 27ms/step - loss: 24.9900 - r2_score: 0.8042 - val_loss: 25.5004 - val_r2_score: 0.8197
Epoch 18/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 24ms/step - loss: 24.8125 - r2_score: 0.8056 - val_loss: 25.3153 - val_r2_score: 0.8210
Epoch 19/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 24.6455 - r2_score: 0.8069 - val_loss: 25.1410 - val_r2_score: 0.8223
Epoch 20/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 24.4877 - r2_score: 0.8081 - val_loss: 24.9766 - val_r2_score: 0.8234
Epoch 21/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 23ms/step - loss: 24.3385 - r2_score: 0.8093 - val_loss: 24.8212 - val_r2_score: 0.8245
Epoch 22/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 5s 23ms/step - loss: 24.1969 - r2_score: 0.8104 - val_loss: 24.6741 - val_r2_score: 0.8256
Epoch 23/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 28ms/step - loss: 24.0624 - r2_score: 0.8114 - val_loss: 24.5347 - val_r2_score: 0.8266
Epoch 24/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 4s 24ms/step - loss: 23.9343 - r2_score: 0.8124 - val_loss: 24.4023 - val_r2_score: 0.8275
Epoch 25/25
151/151 ━━━━━━━━━━━━━━━━━━━━ 6s 29ms/step - loss: 23.8121 - r2_score: 0.8134 - val_loss: 24.2764 - val_r2_score: 0.8284
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  116.03793096542358
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[2]=['-','-','-',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
  • After the first epoch, we see an $R^2$ of ~0.72, which is great.
  • Also, the improvement in the $R^2$ after each epoch has also increased.
  • Note that the time taken to train the model has also increased as model parameters are being updated more often.

Model 3¶

  • Let's now increase the batch size to 64 to see if the performance improves.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(1,input_dim=x_train.shape[1]))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 1)              │           285 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 285 (1.11 KB)
 Trainable params: 285 (1.11 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 64
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 27ms/step - loss: 111.9825 - r2_score: 0.3936 - val_loss: 38.6424 - val_r2_score: 0.7268
Epoch 2/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 36.7211 - r2_score: 0.7141 - val_loss: 35.5817 - val_r2_score: 0.7485
Epoch 3/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 34.1328 - r2_score: 0.7338 - val_loss: 34.1166 - val_r2_score: 0.7588
Epoch 4/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 32.7173 - r2_score: 0.7448 - val_loss: 33.0035 - val_r2_score: 0.7667
Epoch 5/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 31.6776 - r2_score: 0.7529 - val_loss: 32.1211 - val_r2_score: 0.7729
Epoch 6/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 24ms/step - loss: 30.8699 - r2_score: 0.7592 - val_loss: 31.4058 - val_r2_score: 0.7780
Epoch 7/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 30.2217 - r2_score: 0.7642 - val_loss: 30.8142 - val_r2_score: 0.7822
Epoch 8/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 29.6879 - r2_score: 0.7684 - val_loss: 30.3152 - val_r2_score: 0.7857
Epoch 9/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 29.2380 - r2_score: 0.7718 - val_loss: 29.8866 - val_r2_score: 0.7887
Epoch 10/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 32ms/step - loss: 28.8512 - r2_score: 0.7748 - val_loss: 29.5122 - val_r2_score: 0.7914
Epoch 11/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 28.5128 - r2_score: 0.7775 - val_loss: 29.1803 - val_r2_score: 0.7937
Epoch 12/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 28.2123 - r2_score: 0.7798 - val_loss: 28.8820 - val_r2_score: 0.7958
Epoch 13/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 27.9418 - r2_score: 0.7819 - val_loss: 28.6108 - val_r2_score: 0.7978
Epoch 14/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 27.6958 - r2_score: 0.7838 - val_loss: 28.3617 - val_r2_score: 0.7995
Epoch 15/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 32ms/step - loss: 27.4698 - r2_score: 0.7856 - val_loss: 28.1310 - val_r2_score: 0.8011
Epoch 16/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 25ms/step - loss: 27.2606 - r2_score: 0.7872 - val_loss: 27.9158 - val_r2_score: 0.8027
Epoch 17/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 27ms/step - loss: 27.0657 - r2_score: 0.7887 - val_loss: 27.7139 - val_r2_score: 0.8041
Epoch 18/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 26ms/step - loss: 26.8830 - r2_score: 0.7901 - val_loss: 27.5234 - val_r2_score: 0.8054
Epoch 19/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 26ms/step - loss: 26.7110 - r2_score: 0.7915 - val_loss: 27.3431 - val_r2_score: 0.8067
Epoch 20/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 26.5485 - r2_score: 0.7927 - val_loss: 27.1717 - val_r2_score: 0.8079
Epoch 21/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 26.3942 - r2_score: 0.7939 - val_loss: 27.0083 - val_r2_score: 0.8091
Epoch 22/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 26ms/step - loss: 26.2475 - r2_score: 0.7951 - val_loss: 26.8523 - val_r2_score: 0.8102
Epoch 23/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 46ms/step - loss: 26.1075 - r2_score: 0.7961 - val_loss: 26.7029 - val_r2_score: 0.8112
Epoch 24/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 24ms/step - loss: 25.9737 - r2_score: 0.7972 - val_loss: 26.5596 - val_r2_score: 0.8123
Epoch 25/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 24ms/step - loss: 25.8455 - r2_score: 0.7982 - val_loss: 26.4221 - val_r2_score: 0.8132
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  66.82818222045898
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[3]=['-','-','-',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
3 - - - 25 64 SGD 66.828182 27.897743 26.422087 0.779738 0.813226
  • The performance hasn't improved, but the time taken to train the model has reduced.
  • There's always a tradeoff here - performance vs computation time.

Model 4¶

  • Let's now add a hidden layer with 128 neurons.
  • We'll use sigmoid as the activation function.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="sigmoid",input_dim=x_train.shape[1]))
model.add(Dense(1))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 128)            │        36,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │           129 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 36,609 (143.00 KB)
 Trainable params: 36,609 (143.00 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 64
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 32ms/step - loss: 100.2605 - r2_score: 0.4408 - val_loss: 36.9329 - val_r2_score: 0.7389
Epoch 2/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 36.0253 - r2_score: 0.7199 - val_loss: 33.1919 - val_r2_score: 0.7654
Epoch 3/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 31.8483 - r2_score: 0.7519 - val_loss: 30.7305 - val_r2_score: 0.7828
Epoch 4/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 29.2322 - r2_score: 0.7721 - val_loss: 28.7463 - val_r2_score: 0.7968
Epoch 5/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 27.2142 - r2_score: 0.7876 - val_loss: 27.1094 - val_r2_score: 0.8084
Epoch 6/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 38ms/step - loss: 25.5121 - r2_score: 0.8008 - val_loss: 25.6944 - val_r2_score: 0.8184
Epoch 7/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 23.9985 - r2_score: 0.8125 - val_loss: 24.4229 - val_r2_score: 0.8274
Epoch 8/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 22.6255 - r2_score: 0.8231 - val_loss: 23.2575 - val_r2_score: 0.8356
Epoch 9/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 33ms/step - loss: 21.3786 - r2_score: 0.8328 - val_loss: 22.1796 - val_r2_score: 0.8432
Epoch 10/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 20.2531 - r2_score: 0.8415 - val_loss: 21.1771 - val_r2_score: 0.8503
Epoch 11/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 19.2443 - r2_score: 0.8494 - val_loss: 20.2422 - val_r2_score: 0.8569
Epoch 12/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 18.3453 - r2_score: 0.8563 - val_loss: 19.3709 - val_r2_score: 0.8631
Epoch 13/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 17.5467 - r2_score: 0.8626 - val_loss: 18.5624 - val_r2_score: 0.8688
Epoch 14/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 29ms/step - loss: 16.8377 - r2_score: 0.8681 - val_loss: 17.8170 - val_r2_score: 0.8741
Epoch 15/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 16.2078 - r2_score: 0.8730 - val_loss: 17.1348 - val_r2_score: 0.8789
Epoch 16/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 15.6470 - r2_score: 0.8774 - val_loss: 16.5148 - val_r2_score: 0.8833
Epoch 17/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 15.1466 - r2_score: 0.8813 - val_loss: 15.9552 - val_r2_score: 0.8872
Epoch 18/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 14.6988 - r2_score: 0.8848 - val_loss: 15.4533 - val_r2_score: 0.8908
Epoch 19/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 14.2969 - r2_score: 0.8879 - val_loss: 15.0056 - val_r2_score: 0.8939
Epoch 20/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 13.9348 - r2_score: 0.8908 - val_loss: 14.6080 - val_r2_score: 0.8967
Epoch 21/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 13.6071 - r2_score: 0.8934 - val_loss: 14.2554 - val_r2_score: 0.8992
Epoch 22/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - loss: 13.3093 - r2_score: 0.8957 - val_loss: 13.9424 - val_r2_score: 0.9014
Epoch 23/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 13.0374 - r2_score: 0.8979 - val_loss: 13.6638 - val_r2_score: 0.9034
Epoch 24/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 46ms/step - loss: 12.7880 - r2_score: 0.8998 - val_loss: 13.4147 - val_r2_score: 0.9052
Epoch 25/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 32ms/step - loss: 12.5583 - r2_score: 0.9016 - val_loss: 13.1910 - val_r2_score: 0.9068
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  81.29296040534973
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[4]=[1,128,'sigmoid',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
3 - - - 25 64 SGD 66.828182 27.897743 26.422087 0.779738 0.813226
4 1 128 sigmoid 25 64 SGD 81.292960 13.616706 13.191008 0.892491 0.906755
  • We see an improvement in the model performance.
  • The time taken too has not increased drastically.

Model 5¶

  • We'll now change the activation for the hidden layer from sigmoid to tanh.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="tanh",input_dim=x_train.shape[1]))
model.add(Dense(1))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 128)            │        36,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │           129 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 36,609 (143.00 KB)
 Trainable params: 36,609 (143.00 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 64
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 79.5473 - r2_score: 0.5943 - val_loss: 35.1616 - val_r2_score: 0.7514
Epoch 2/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 30.2468 - r2_score: 0.7651 - val_loss: 27.9162 - val_r2_score: 0.8027
Epoch 3/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 25.7403 - r2_score: 0.7997 - val_loss: 23.4933 - val_r2_score: 0.8339
Epoch 4/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 22.1494 - r2_score: 0.8272 - val_loss: 21.4471 - val_r2_score: 0.8484
Epoch 5/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step - loss: 19.6231 - r2_score: 0.8466 - val_loss: 19.7989 - val_r2_score: 0.8600
Epoch 6/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 17.8089 - r2_score: 0.8607 - val_loss: 18.0529 - val_r2_score: 0.8724
Epoch 7/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 16.4196 - r2_score: 0.8715 - val_loss: 16.4278 - val_r2_score: 0.8839
Epoch 8/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 15.3018 - r2_score: 0.8802 - val_loss: 15.0046 - val_r2_score: 0.8939
Epoch 9/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 14.3681 - r2_score: 0.8874 - val_loss: 13.8965 - val_r2_score: 0.9018
Epoch 10/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 13.5662 - r2_score: 0.8936 - val_loss: 13.0059 - val_r2_score: 0.9081
Epoch 11/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 12.8818 - r2_score: 0.8989 - val_loss: 12.3403 - val_r2_score: 0.9128
Epoch 12/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 12.2779 - r2_score: 0.9036 - val_loss: 11.8500 - val_r2_score: 0.9162
Epoch 13/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 11.7520 - r2_score: 0.9077 - val_loss: 11.4965 - val_r2_score: 0.9187
Epoch 14/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 11.2884 - r2_score: 0.9113 - val_loss: 11.2293 - val_r2_score: 0.9206
Epoch 15/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 10.8749 - r2_score: 0.9146 - val_loss: 11.0060 - val_r2_score: 0.9222
Epoch 16/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 10.5038 - r2_score: 0.9175 - val_loss: 10.8015 - val_r2_score: 0.9236
Epoch 17/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 10.1706 - r2_score: 0.9201 - val_loss: 10.6096 - val_r2_score: 0.9250
Epoch 18/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 9.8682 - r2_score: 0.9225 - val_loss: 10.4366 - val_r2_score: 0.9262
Epoch 19/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 9.5903 - r2_score: 0.9247 - val_loss: 10.2879 - val_r2_score: 0.9273
Epoch 20/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 9.3318 - r2_score: 0.9267 - val_loss: 10.1541 - val_r2_score: 0.9282
Epoch 21/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 9.0828 - r2_score: 0.9287 - val_loss: 10.0276 - val_r2_score: 0.9291
Epoch 22/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 8.8365 - r2_score: 0.9306 - val_loss: 9.9145 - val_r2_score: 0.9299
Epoch 23/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 8.5971 - r2_score: 0.9325 - val_loss: 9.8223 - val_r2_score: 0.9306
Epoch 24/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - loss: 8.3698 - r2_score: 0.9343 - val_loss: 9.7550 - val_r2_score: 0.9310
Epoch 25/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 8.1562 - r2_score: 0.9360 - val_loss: 9.7151 - val_r2_score: 0.9313
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  77.24922800064087
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[5]=[1,128,'tanh',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
3 - - - 25 64 SGD 66.828182 27.897743 26.422087 0.779738 0.813226
4 1 128 sigmoid 25 64 SGD 81.292960 13.616706 13.191008 0.892491 0.906755
5 1 128 tanh 25 64 SGD 77.249228 8.859550 9.715087 0.930051 0.931325
  • Changing the activation has improved the $R^2$.

Model 6¶

  • We'll now change the activation for the hidden layer from tanh to relu
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="relu",input_dim=x_train.shape[1]))
model.add(Dense(1))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 128)            │        36,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │           129 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 36,609 (143.00 KB)
 Trainable params: 36,609 (143.00 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 64
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 78.1409 - r2_score: 0.6149 - val_loss: 22.6298 - val_r2_score: 0.8400
Epoch 2/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 31ms/step - loss: 24.4104 - r2_score: 0.8088 - val_loss: 20.3330 - val_r2_score: 0.8563
Epoch 3/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 20.8962 - r2_score: 0.8361 - val_loss: 20.2712 - val_r2_score: 0.8567
Epoch 4/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step - loss: 18.9292 - r2_score: 0.8516 - val_loss: 19.0919 - val_r2_score: 0.8650
Epoch 5/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 17.5839 - r2_score: 0.8622 - val_loss: 18.7435 - val_r2_score: 0.8675
Epoch 6/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 16.6699 - r2_score: 0.8696 - val_loss: 18.1998 - val_r2_score: 0.8713
Epoch 7/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 15.9100 - r2_score: 0.8757 - val_loss: 18.0499 - val_r2_score: 0.8724
Epoch 8/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 15.2037 - r2_score: 0.8814 - val_loss: 17.2522 - val_r2_score: 0.8780
Epoch 9/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 14.6969 - r2_score: 0.8855 - val_loss: 16.7128 - val_r2_score: 0.8819
Epoch 10/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 14.1779 - r2_score: 0.8896 - val_loss: 16.0255 - val_r2_score: 0.8867
Epoch 11/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 13.7966 - r2_score: 0.8926 - val_loss: 16.2979 - val_r2_score: 0.8848
Epoch 12/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 13.3378 - r2_score: 0.8964 - val_loss: 15.2482 - val_r2_score: 0.8922
Epoch 13/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 28ms/step - loss: 13.0258 - r2_score: 0.8988 - val_loss: 15.3594 - val_r2_score: 0.8914
Epoch 14/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 12.6184 - r2_score: 0.9021 - val_loss: 14.2531 - val_r2_score: 0.8992
Epoch 15/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 12.3351 - r2_score: 0.9042 - val_loss: 14.5103 - val_r2_score: 0.8974
Epoch 16/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 11.9488 - r2_score: 0.9074 - val_loss: 14.1340 - val_r2_score: 0.9001
Epoch 17/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 11.7498 - r2_score: 0.9089 - val_loss: 15.1942 - val_r2_score: 0.8926
Epoch 18/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 11.6430 - r2_score: 0.9100 - val_loss: 14.6469 - val_r2_score: 0.8965
Epoch 19/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 11.2995 - r2_score: 0.9126 - val_loss: 14.3911 - val_r2_score: 0.8983
Epoch 20/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 29ms/step - loss: 10.9995 - r2_score: 0.9149 - val_loss: 15.4576 - val_r2_score: 0.8907
Epoch 21/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 11.0765 - r2_score: 0.9145 - val_loss: 13.6486 - val_r2_score: 0.9035
Epoch 22/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 10.4216 - r2_score: 0.9194 - val_loss: 14.7097 - val_r2_score: 0.8960
Epoch 23/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 10.3984 - r2_score: 0.9196 - val_loss: 14.1583 - val_r2_score: 0.8999
Epoch 24/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 31ms/step - loss: 10.1541 - r2_score: 0.9215 - val_loss: 14.2825 - val_r2_score: 0.8990
Epoch 25/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 9.9332 - r2_score: 0.9232 - val_loss: 14.3455 - val_r2_score: 0.8986
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  73.81564784049988
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[6]=[1,128,'relu',epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
3 - - - 25 64 SGD 66.828182 27.897743 26.422087 0.779738 0.813226
4 1 128 sigmoid 25 64 SGD 81.292960 13.616706 13.191008 0.892491 0.906755
5 1 128 tanh 25 64 SGD 77.249228 8.859550 9.715087 0.930051 0.931325
6 1 128 relu 25 64 SGD 73.815648 9.730357 14.345462 0.923175 0.898594
  • We couldn't see much improvement

Model 7¶

  • We will now add one more hidden layer with 32 neurons.
  • We'll use relu activation in both hidden layers.
In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="relu",input_dim=x_train.shape[1]))
model.add(Dense(32,activation="relu"))
model.add(Dense(1))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 128)            │        36,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 32)             │         4,128 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            33 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 40,641 (158.75 KB)
 Trainable params: 40,641 (158.75 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 64
In [ ]:
start = time.time()
history = model.fit(x_train, y_train, validation_data=(x_val,y_val) , batch_size=batch_size, epochs=epochs)
end=time.time()
Epoch 1/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 39ms/step - loss: 119.0316 - r2_score: 0.3395 - val_loss: 174.7455 - val_r2_score: -0.2353
Epoch 2/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 141.1910 - r2_score: -0.0877 - val_loss: 125.4722 - val_r2_score: 0.1131
Epoch 3/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 103.6209 - r2_score: 0.2028 - val_loss: 92.8917 - val_r2_score: 0.3434
Epoch 4/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 44ms/step - loss: 80.4926 - r2_score: 0.3807 - val_loss: 68.0757 - val_r2_score: 0.5188
Epoch 5/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 61.7936 - r2_score: 0.5238 - val_loss: 71.7318 - val_r2_score: 0.4929
Epoch 6/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 32ms/step - loss: 74.3823 - r2_score: 0.4244 - val_loss: 67.4626 - val_r2_score: 0.5231
Epoch 7/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 38ms/step - loss: 71.3977 - r2_score: 0.4470 - val_loss: 65.2459 - val_r2_score: 0.5388
Epoch 8/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 72.4106 - r2_score: 0.4396 - val_loss: 140.3398 - val_r2_score: 0.0080
Epoch 9/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 32ms/step - loss: 69.9716 - r2_score: 0.4682 - val_loss: 66.5744 - val_r2_score: 0.5294
Epoch 10/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 68.2321 - r2_score: 0.4719 - val_loss: 56.5915 - val_r2_score: 0.6000
Epoch 11/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 35ms/step - loss: 58.4738 - r2_score: 0.5476 - val_loss: 49.1373 - val_r2_score: 0.6527
Epoch 12/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 50.7393 - r2_score: 0.6077 - val_loss: 43.1657 - val_r2_score: 0.6949
Epoch 13/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 45.3931 - r2_score: 0.6491 - val_loss: 40.0219 - val_r2_score: 0.7171
Epoch 14/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 33ms/step - loss: 41.0568 - r2_score: 0.6832 - val_loss: 35.5701 - val_r2_score: 0.7486
Epoch 15/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 37.9652 - r2_score: 0.7067 - val_loss: 32.7239 - val_r2_score: 0.7687
Epoch 16/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 35.4936 - r2_score: 0.7256 - val_loss: 30.1875 - val_r2_score: 0.7866
Epoch 17/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 36ms/step - loss: 33.8804 - r2_score: 0.7376 - val_loss: 27.9216 - val_r2_score: 0.8026
Epoch 18/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 33ms/step - loss: 31.1315 - r2_score: 0.7591 - val_loss: 26.1075 - val_r2_score: 0.8154
Epoch 19/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 35ms/step - loss: 29.7628 - r2_score: 0.7695 - val_loss: 27.1189 - val_r2_score: 0.8083
Epoch 20/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 30.3165 - r2_score: 0.7646 - val_loss: 23.1413 - val_r2_score: 0.8364
Epoch 21/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 34ms/step - loss: 28.9596 - r2_score: 0.7751 - val_loss: 32.8746 - val_r2_score: 0.7676
Epoch 22/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 44ms/step - loss: 27.2722 - r2_score: 0.7889 - val_loss: 21.7117 - val_r2_score: 0.8465
Epoch 23/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 33ms/step - loss: 27.3585 - r2_score: 0.7874 - val_loss: 20.9258 - val_r2_score: 0.8521
Epoch 24/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 25.7879 - r2_score: 0.7994 - val_loss: 20.4467 - val_r2_score: 0.8555
Epoch 25/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 6s 45ms/step - loss: 23.9112 - r2_score: 0.8142 - val_loss: 19.0233 - val_r2_score: 0.8655
In [ ]:
print("Time taken in seconds ",end-start)
Time taken in seconds  104.9405345916748
In [ ]:
plot(history,'loss')
No description has been provided for this image
In [ ]:
plot(history,'r2_score')
No description has been provided for this image
In [ ]:
results.loc[7]=[2,[128,32],['relu','relu'],epochs,batch_size,'SGD',(end-start),history.history["loss"][-1],history.history["val_loss"][-1],history.history["r2_score"][-1],history.history["val_r2_score"][-1]]
In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
3 - - - 25 64 SGD 66.828182 27.897743 26.422087 0.779738 0.813226
4 1 128 sigmoid 25 64 SGD 81.292960 13.616706 13.191008 0.892491 0.906755
5 1 128 tanh 25 64 SGD 77.249228 8.859550 9.715087 0.930051 0.931325
6 1 128 relu 25 64 SGD 73.815648 9.730357 14.345462 0.923175 0.898594
7 2 [128, 32] [relu, relu] 25 64 SGD 104.940535 25.567276 19.023302 0.798137 0.865527
  • Adding a hidden layer didn't improve the performance of the model.

Model Performance Comparison and Final Model Selection¶

In [ ]:
results
Out[ ]:
# hidden layers # neurons - hidden layer activation function - hidden layer # epochs batch size optimizer time(secs) Train_loss Valid_loss Train_R-squared Valid_R-squared
0 - - - 10 4814 GD 2.855393 116.842415 127.022896 0.077489 0.102093
1 - - - 25 4814 GD 4.981976 62.466640 66.866333 0.506804 0.527331
2 - - - 25 32 SGD 116.037931 25.865023 24.276419 0.795787 0.828393
3 - - - 25 64 SGD 66.828182 27.897743 26.422087 0.779738 0.813226
4 1 128 sigmoid 25 64 SGD 81.292960 13.616706 13.191008 0.892491 0.906755
5 1 128 tanh 25 64 SGD 77.249228 8.859550 9.715087 0.930051 0.931325
6 1 128 relu 25 64 SGD 73.815648 9.730357 14.345462 0.923175 0.898594
7 2 [128, 32] [relu, relu] 25 64 SGD 104.940535 25.567276 19.023302 0.798137 0.865527
  • Among all other models, Model 5 and 6 achieved the highest training and validation scores.

  • We can choose any one of them. Let's choose the model 6 as there is some difference in the train and valid scores and it seems to be realistic.

  • We'll go ahead with this model as our final model.

  • Let's rebuild it and check its performance across multiple metrics

Final Model¶

In [ ]:
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
In [ ]:
#Initializing the neural network
model = Sequential()
model.add(Dense(128,activation="relu",input_dim=x_train.shape[1]))
model.add(Dense(1))
In [ ]:
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 128)            │        36,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │           129 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 36,609 (143.00 KB)
 Trainable params: 36,609 (143.00 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
optimizer = keras.optimizers.SGD()    # defining SGD as the optimizer to be used
model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=metrics,run_eagerly=True)
In [ ]:
epochs = 25
batch_size = 64
In [ ]:
history = model.fit(x_train, y_train, validation_data=(x_test,y_test) , batch_size=batch_size, epochs=epochs)
Epoch 1/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 31ms/step - loss: 78.9293 - r2_score: 0.5931 - val_loss: 15.2803 - val_r2_score: 0.8406
Epoch 2/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 24.1295 - r2_score: 0.8110 - val_loss: 13.2398 - val_r2_score: 0.8619
Epoch 3/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 34ms/step - loss: 20.7745 - r2_score: 0.8370 - val_loss: 12.4987 - val_r2_score: 0.8696
Epoch 4/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 30ms/step - loss: 18.8243 - r2_score: 0.8524 - val_loss: 12.0415 - val_r2_score: 0.8744
Epoch 5/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 17.5857 - r2_score: 0.8623 - val_loss: 11.4448 - val_r2_score: 0.8806
Epoch 6/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 16.5880 - r2_score: 0.8702 - val_loss: 11.1857 - val_r2_score: 0.8833
Epoch 7/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 35ms/step - loss: 15.7657 - r2_score: 0.8769 - val_loss: 10.9638 - val_r2_score: 0.8856
Epoch 8/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 5s 29ms/step - loss: 15.0834 - r2_score: 0.8824 - val_loss: 10.9710 - val_r2_score: 0.8855
Epoch 9/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 29ms/step - loss: 14.5815 - r2_score: 0.8865 - val_loss: 10.3246 - val_r2_score: 0.8923
Epoch 10/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 31ms/step - loss: 14.1321 - r2_score: 0.8901 - val_loss: 10.7137 - val_r2_score: 0.8882
Epoch 11/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 38ms/step - loss: 13.6907 - r2_score: 0.8937 - val_loss: 10.4253 - val_r2_score: 0.8912
Epoch 12/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - loss: 13.3379 - r2_score: 0.8965 - val_loss: 9.9991 - val_r2_score: 0.8957
Epoch 13/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 12.9711 - r2_score: 0.8994 - val_loss: 9.8395 - val_r2_score: 0.8973
Epoch 14/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 30ms/step - loss: 12.6858 - r2_score: 0.9017 - val_loss: 9.5879 - val_r2_score: 0.9000
Epoch 15/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 40ms/step - loss: 12.3916 - r2_score: 0.9040 - val_loss: 9.1747 - val_r2_score: 0.9043
Epoch 16/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 30ms/step - loss: 11.9349 - r2_score: 0.9075 - val_loss: 8.9432 - val_r2_score: 0.9067
Epoch 17/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 11.7616 - r2_score: 0.9089 - val_loss: 8.7636 - val_r2_score: 0.9086
Epoch 18/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 28ms/step - loss: 11.3985 - r2_score: 0.9117 - val_loss: 8.9226 - val_r2_score: 0.9069
Epoch 19/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 28ms/step - loss: 11.2242 - r2_score: 0.9131 - val_loss: 8.5445 - val_r2_score: 0.9108
Epoch 20/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 3s 41ms/step - loss: 10.9598 - r2_score: 0.9151 - val_loss: 8.6441 - val_r2_score: 0.9098
Epoch 21/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 30ms/step - loss: 10.7909 - r2_score: 0.9165 - val_loss: 8.3364 - val_r2_score: 0.9130
Epoch 22/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 10.5877 - r2_score: 0.9180 - val_loss: 7.8640 - val_r2_score: 0.9179
Epoch 23/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 2s 28ms/step - loss: 10.2563 - r2_score: 0.9205 - val_loss: 7.8603 - val_r2_score: 0.9180
Epoch 24/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 41ms/step - loss: 10.1446 - r2_score: 0.9213 - val_loss: 7.5617 - val_r2_score: 0.9211
Epoch 25/25
76/76 ━━━━━━━━━━━━━━━━━━━━ 4s 29ms/step - loss: 9.7241 - r2_score: 0.9246 - val_loss: 7.7915 - val_r2_score: 0.9187
In [ ]:
train_perf = model_performance(model,x_train,y_train)
print("Train performance")
pd.DataFrame(train_perf)
151/151 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step
Train performance
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 3.305341 1.563728 0.913741 0.908332 19.067221
In [ ]:
x_val.isnull().sum()
Out[ ]:
0
Kilometers_Driven 0
Seats 0
New_Price 0
mileage_num 0
engine_num 0
... ...
Model_xylo 0
Model_yeti 0
Model_z4 0
Model_zen 0
Model_zest 0

284 rows × 1 columns


In [ ]:
y_val.isnull().sum()
Out[ ]:
0
In [ ]:
valid_perf = model_performance(model,x_val,y_val)
print("Validation data performance")
pd.DataFrame(valid_perf)
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Validation data performance
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 3.588983 1.87963 0.908947 0.827374 19.785901
In [ ]:
test_perf = model_performance(model,x_test,y_test)
print("Test performance")
pd.DataFrame(test_perf)
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Test performance
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 2.791322 1.4971 0.918701 0.845865 18.660195
  • The model has an $R^2$ of ~0.84 on the test set, which means it can explain ~84% of the variance in the unseen data

  • The RMSE value is ~2.7 , which means the model can predict the price of a used car within 2.7 units of the actual value

  • The MAPE value is ~1.4 , which means the model can predict the price of a used car within ~ 1.4% of the actual value

Business Insights and Recommendations¶

  1. Our neural network model has successfully explained approximately 94% of the variation in the data.
  2. Our analysis has revealed that certain factors, such as the year of manufacture, the number of seats, and the maximum power of the engine, tend to increase the price of a used car. Conversely, factors like the distance traveled and engine volume tend to decrease the price of a used car.
  3. Certain markets tend to have higher prices, and it would be beneficial for Cars4U to focus on these markets and establish offices in these areas if necessary.
  4. We need to gather data on the cost side of things before discussing profitability in the business.
  5. After analyzing the data, the next step would be to cluster the different data sets and determine whether we should create multiple models for different locations or car types.

Appendix: Detailed Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Kilometers_Driven¶

In [ ]:
histogram_boxplot(df1, "Kilometers_Driven", bins=100, kde=True)
No description has been provided for this image

Observations

  • This is another highly skewed distribution.
  • Let us use log transformation on this column too.
In [ ]:
df1["kilometers_driven_log"] = np.log(df1["Kilometers_Driven"])
In [ ]:
histogram_boxplot(df1, "kilometers_driven_log", bins=100, kde=True)
No description has been provided for this image
  • Transformation has reduced the extreme skewness.

mileage_num¶

In [ ]:
histogram_boxplot(df1, "mileage_num", kde=True)
No description has been provided for this image

Observations

  • This is a close to normally distributed attribute.

engine_num¶

In [ ]:
histogram_boxplot(df1, "engine_num", kde=True)
No description has been provided for this image

Observations

  • There are a few car with a higher engine displacement volume.

power_num¶

In [ ]:
histogram_boxplot(df1, "power_num", kde=True)
No description has been provided for this image

Observations

  • There are a few car with a higher engine power.
In [ ]:
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
No description has been provided for this image
  • Price: The price of a used car is the target variable and has a highly skewed distribution, with a median value of around 53.5 lakh INR. The log transformation was applied on this column to reduce skewness. The displacement volume of the engine, the maximum power of the engine and the price of a new car of the same model is highly correlated with the price of a used car.
  • Mileage: This attribute has a close to normally distribution. With increase in mileage, the engine displacement and power decrease.
  • Engine: There are a few upper outliers, indicating that there are a few car with a higher engine displacement volume. Higher priced cars have higher engine displacement. It is also highly correlated with the maximum engine power.
  • Power: There are a few upper outliers, indicating that there are a few car with a higher power. Higher priced cars have higher maximum power. It is also highly correlated with the engine displacement volume.
  • Kilometers_driven: The number of kilometers a used car is driven has a highly skewed distribution, with a median value of around 53.5 thousand. The log transformation was applied on this column to reduce skewness.
  • New_Price: The price of a used car is the target variable and has a highly skewed distribution, with a median value of around 11.3 lakh INR. The log transformation was applied on this column to reduce skewness.
  • Seats: 84% of the cars in the dataset are 5-seater cars.
  • Year: More than half the cars in the data were manufactured in or after 2014. The price of used cars has increased over the years.
  • Brand: Most of the cars in the data belong to Maruti or Hyundai. The price of used cars is lower for budget brands like Porsche, Bentley, Lamborghini, etc. The price of used cars is higher for premium brands like Maruti, Tata, Fiat, etc.
  • Model: Maruti Swift is the most common car up for resale. The dataset contains used cars from luxury as well as budget-friendly brands.
  • Location: Hyderabad and Mumbai have the most demand for used cars. The price of used cars has a large IQR in Coimbatore and Bangalore.
  • Fuel_Type: Around 1% of the cars in the dataset do not run on diesel or petrol. Electric cars have the highest median price, followed by diesel cars.
  • Transmission: More than 70% of the cars have manual transmission. The price is higher for used cars with automatic transmission.
  • Owner_Type: More than 80% of the used cars are being sold for the first time. The price of cars decreases as they keep getting resold.

Model¶

In [ ]:
labeled_barplot(df1, "Model", perc=True, n=10)
No description has been provided for this image

Observations

  • Maruti Swift is the most common car up for resale.

  • It is clear from the above charts that our dataset contains used cars from luxury as well as budget-friendly brands.

  • We can create a new variable using this information. We can consider binning all our cars into the following 3 categories later:

    1. Budget-Friendly
    2. Mid Range
    3. Luxury Cars

Seats¶

In [ ]:
labeled_barplot(df1, "Seats", perc=True)
No description has been provided for this image
  • 84% of the cars in the dataset are 5-seater cars.

Year¶

In [ ]:
labeled_barplot(df1, "Year", perc=True)
No description has been provided for this image
  • More than half the cars in the data were manufactured in or after 2014.

Transmission¶

In [ ]:
labeled_barplot(df1, "Transmission", perc=True)
No description has been provided for this image
  • More than 70% of the cars have manual transmission.

Owner_Type¶

In [ ]:
labeled_barplot(df1, "Owner_Type", perc=True)
No description has been provided for this image
  • More than 80% of the used cars are being sold for the first time.

Bivariate Analysis¶

Let's check the variation in Price with some of the other variables.

Price vs Transmission¶

In [ ]:
plt.figure(figsize=(5, 5))
sns.boxplot(x="Transmission", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • The price is higher for used cars with automatic transmission.

Price vs Fuel_Type¶

In [ ]:
plt.figure(figsize=(18, 5))
sns.boxplot(x="Fuel_Type", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • Electric cars have the highest median price, followed by diesel cars.

Price vs Brand¶

In [ ]:
plt.figure(figsize=(18, 5))
sns.boxplot(x="Brand", y="Price", data=df1)
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
  • The price of used cars is lower for budget brands like Maruti, Tata, Fiat, etc.
  • The price of used cars is higher for premium brands like Porsche, Audi, Lamborghini, etc.

Price vs Transmission¶

In [ ]:
plt.figure(figsize=(5, 5))
sns.boxplot(x="Transmission", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • The price is higher for used cars with automatic transmission.

Price vs Fuel_Type¶

In [ ]:
plt.figure(figsize=(18, 5))
sns.boxplot(x="Fuel_Type", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • Electric cars have the highest median price, followed by diesel cars.

Price vs Owner_Type¶

In [ ]:
plt.figure(figsize=(18, 5))
sns.boxplot(x="Owner_Type", y="Price", data=df1)
plt.show()
No description has been provided for this image
  • The price of cars decreases as they keep getting resold.

Pairplot for relations between numerical variables¶

In [ ]:
sns.pairplot(data=df1, hue="Fuel_Type")
plt.show()
No description has been provided for this image

Zooming into these plots gives us a lot of information.

  • Contrary to intuition, Kilometers_Driven does not seem to have a relationship with the price.

  • Price has a positive relationship with Year, i.e., the newer the car, the higher the price.

    • The temporal element of variation is captured in the year column.
  • 2 seater cars are all luxury variants. Cars with 8-10 seats are exclusively mid to high range.

  • Mileage does not seem to show much relationship with the price of used cars.

  • Engine displacement and power of the car have a positive relationship with the price.

  • New_Price and used car price are also positively correlated, which is expected.

  • Kilometers_Driven has a peculiar relationship with the Year variable. Generally, the newer the car lesser the distance it has traveled, but this is not always true.

  • CNG cars are conspicuous outliers when it comes to Mileage. The mileage of these cars is very high.

  • The mileage and power of newer cars are increasing owing to advancements in technology.

  • Mileage has a negative correlation with engine displacement and power. More powerful the engine, the more fuel it consumes in general.

To jump back to the EDA summary section, click here.